With the development of modern large scale computing technologies, machine learing techniques have been increasingly applied to large scale health care datasets, such as clinical data and biological data. Developing and optimizing health-related systems are no doubt promising as our human beings' health conditions always come first. Among numerous research topics, predicting hospital 30-Day readmission is of our interests. If we are able to accurately predict readmission, we can identify patients at high risk and prevent early discharging them. As a result, patients will be given more precision care and hospitals can better manage resources, reduce waste room and save costs. Existing works have already explored this topic [1-4]. Before designing the model, extracting most useful features is of great importance. By far, the features that have been taken into consideration includes, demographic characteristics: age, gender, race etc; lab results, chart events etc. monitored in ICU [1]; electronic health record data: length of stay, number of admissions, admission type etc. [2,3]. Machine learning techniques that have been used include random forest [1], neural networks [4] etc. Our project goal is to extract relevant features from MIMIC database and design classification models to optimize performance.
References:
[1] Yaron Blinder. Predicting 30-day ICU readmissions from the MIMIC-III database. https://github.com/YaronBlinder/MIMIC-III_readmission. 2017.
[2] Oanh Kieu Nguyen et al. Predicting all-cause readmissions using electronic health record data from the entire hospitalization: model development and comparison". In: Journal of hospital medicine 11.7 (2016), pp. 473-480.
[3] Frida Kareliusson, Lina De Geer, and Anna Oscarsson Tibblin. Risk prediction of ICU readmission in a mixed surgical and medical population". In: Journal of intensive care 3.1 (2015), p. 30.
[4] Ricardo Bento Afonso. Feature Extraction and Selection for Prediction of ICU Patient's Readmission Using Artificial Neural Networks". In: (2013).
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
import time
import math
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
import warnings
warnings.filterwarnings('ignore')
plt.rcParams['figure.dpi'] = 300
conn = sqlite3.connect('C:/Users/culiubaicaiya/Downloads/bme590 course note/data/MIMIC.db')
readmit_df = pd.read_sql('''SELECT * FROM
(SELECT SUBJECT_ID, HADM_ID, ADMITTIME, DISCHTIME,
RANK() OVER(PARTITION BY SUBJECT_ID ORDER BY ADMITTIME) AS ADMITTIME_RANK
FROM ADMISSIONS) a
LEFT JOIN
(SELECT SUBJECT_ID, MAX(ADMITTIME_RANK) AS MAX_ADMITTIME_RANK FROM
(SELECT SUBJECT_ID, HADM_ID, ADMITTIME, DISCHTIME,
RANK() OVER(PARTITION BY SUBJECT_ID ORDER BY ADMITTIME) AS ADMITTIME_RANK
FROM ADMISSIONS)
GROUP BY SUBJECT_ID) b
ON a.SUBJECT_ID = b.SUBJECT_ID''', conn)
readmit_df[['ADMITTIME','DISCHTIME']] = readmit_df[['ADMITTIME','DISCHTIME']].apply(lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S'))
next_admittime = readmit_df['ADMITTIME'].values[1:]
next_admittime = np.append(next_admittime, next_admittime[0])
readmit_df['NEXT_ADMITTIME'] = next_admittime
readmit_df['30_READMIT'] = -1
readmit_df.loc[readmit_df['ADMITTIME_RANK'] == readmit_df['MAX_ADMITTIME_RANK'], '30_READMIT'] = 0
readmit_sub_df = readmit_df[readmit_df['30_READMIT']==-1]
interval = (readmit_sub_df['NEXT_ADMITTIME'] - readmit_sub_df['DISCHTIME']).dt.days.values
readmit_df.loc[readmit_df['30_READMIT']==-1, 'INTERVAL'] = interval
readmit_df.loc[readmit_df['INTERVAL']<=30, '30_READMIT'] = 1
readmit_df.loc[readmit_df['INTERVAL']>30, '30_READMIT'] = 0
readmit_df.head()
readmit_label_df = readmit_df[['HADM_ID', '30_READMIT']]
readmit_label_df.head()
readmit_label_df.shape
le_df = pd.read_sql(''' select le.subject_id, le.hadm_id
, min(case when le.itemid=51006 then le.valuenum else null end) as urea_N_min
, max(case when le.itemid=51006 then le.valuenum else null end) as urea_N_max
, avg(case when le.itemid=51006 then le.valuenum else null end) as urea_N_mean
, min(case when le.itemid=51265 then le.valuenum else null end) as platelets_min
, max(case when le.itemid=51265 then le.valuenum else null end) as platelets_max
, avg(case when le.itemid=51265 then le.valuenum else null end) as platelets_mean
, max(case when le.itemid=50960 then le.valuenum else null end) as magnesium_max
, min(case when le.itemid=50862 then le.valuenum else null end) as albumin_min
, min(case when le.itemid=50893 then le.valuenum else null end) as calcium_min
from labevents le
where hadm_id is not null
group by 1,2 order by 1,2
''', conn)
le_df.head()
chev_df = pd.read_sql('''select hadm_id
, min(case when itemid in (615,618,220210,224690) and valuenum > 0 and valuenum < 70 then valuenum else null end) as RespRate_Min
, max(case when itemid in (615,618,220210,224690) and valuenum > 0 and valuenum < 70 then valuenum else null end) as RespRate_Max
, avg(case when itemid in (615,618,220210,224690) and valuenum > 0 and valuenum < 70 then valuenum else null end) as RespRate_Mean
, min(case when itemid in (807,811,1529,3745,3744,225664,220621,226537) and valuenum > 0 then valuenum else null end) as Glucose_Min
, max(case when itemid in (807,811,1529,3745,3744,225664,220621,226537) and valuenum > 0 then valuenum else null end) as Glucose_Max
, avg(case when itemid in (807,811,1529,3745,3744,225664,220621,226537) and valuenum > 0 then valuenum else null end) as Glucose_Mean
, min(case when itemid in (211,220045) and valuenum > 0 and valuenum < 300 then valuenum else null end) as HR_min
, max(case when itemid in (211,220045) and valuenum > 0 and valuenum < 300 then valuenum else null end) as HR_max
, round(cast(avg(case when itemid in (211,220045) and valuenum > 0 and valuenum < 300 then valuenum else null end) as numeric), 2) as HR_mean
, min(case when itemid in (51,442,455,6701,220179,220050) and valuenum > 0 and valuenum < 400 then valuenum else null end) as SysBP_min
, max(case when itemid in (51,442,455,6701,220179,220050) and valuenum > 0 and valuenum < 400 then valuenum else null end) as SysBP_max
, round(cast(avg(case when itemid in (51,442,455,6701,220179,220050) and valuenum > 0 and valuenum < 400 then valuenum else null end) as numeric), 2) as SysBP_mean
, min(case when itemid in (8368,8440,8441,8555,220180,220051) and valuenum > 0 and valuenum < 300 then valuenum else null end) as DiasBP_min
, max(case when itemid in (8368,8440,8441,8555,220180,220051) and valuenum > 0 and valuenum < 300 then valuenum else null end) as DiasBP_max
, round(cast(avg(case when itemid in (8368,8440,8441,8555,220180,220051) and valuenum > 0 and valuenum < 300 then valuenum else null end) as numeric), 2) as DiasBP_mean
, min(case when itemid in (223761,678) and valuenum > 70 and valuenum < 120 then (valuenum-32)/1.8
when itemid in (223762,676) and valuenum > 10 and valuenum < 50 then valuenum else null end) as temp_min
, max(case when itemid in (223761,678) and valuenum > 70 and valuenum < 120 then (valuenum-32)/1.8
when itemid in (223762,676) and valuenum > 10 and valuenum < 50 then valuenum else null end) as temp_max
, round(cast(avg(case when itemid in (223761,678) and valuenum > 70 and valuenum < 120 then (valuenum-32)/1.8
when itemid in (223762,676) and valuenum > 10 and valuenum < 50 then valuenum else null end) as numeric), 2) as temp_mean
from chartevents
where itemid in
(
615,618,220210,224690, --- RespRate
807,811,1529,3745,3744,225664,220621,226537, --- Glucose
211,220045,---HR
51,442,455,6701,220179,220050,---SysBP
8368,8440,8441,8555,220180,220051,--DiasBP
223761,678,223762,676--Temp
)
and hadm_id is not null
group by 1
''', conn)
chev_df.head()
chev_df = chev_df.astype('float64')
opev_df = pd.read_sql('''select hadm_id
, min(value) as urine_min
, max(value) as urine_max
, round(cast(avg(value) as numeric)) as urine_mean
from outputevents
where itemid in (40055,226559)
and hadm_id is not null
group by 1
''', conn)
opev_df.head()
age_df = pd.read_sql(''' SELECT a.HADM_ID, ADMITTIME, b. DOB
FROM ADMISSIONS a
INNER JOIN PATIENTS b on a.SUBJECT_ID = b.SUBJECT_ID
''', conn)
age_df[['ADMITTIME','DOB']] =age_df[['ADMITTIME','DOB']].apply(lambda x: pd.to_datetime(x, format='%Y-%m-%d %H:%M:%S'))
idx_list = []
for idx in range(age_df.shape[0]):
age = (age_df['ADMITTIME'][idx] - age_df['DOB'][idx]).days //365
if age != 0 and age<= 120:
idx_list.append(idx)
age_df = age_df.loc[idx_list, :]
age_df['AGE'] = (age_df['ADMITTIME'] - age_df['DOB']).dt.days.values // 365
age_df = age_df.drop(['ADMITTIME', 'DOB'], axis=1)
age_df.head()
numfeat_df = pd.merge(le_df, opev_df, left_on=['HADM_ID'], right_on=['HADM_ID'], how='inner')
numfeat_df = pd.merge(numfeat_df, age_df, left_on=['HADM_ID'], right_on=['HADM_ID'], how='inner')
numfeat_df = pd.merge(numfeat_df, chev_df, left_on=['HADM_ID'], right_on=['HADM_ID'], how='inner')
numfeat_df = numfeat_df.drop(['SUBJECT_ID'], axis=1)
# fill missing values with column mean
numfeat_mean = numfeat_df.mean().to_dict()
numfeat_df = numfeat_df.fillna(value=numfeat_mean)
numfeat_df.head()
catfeat_df = pd.read_sql(''' SELECT a.HADM_ID, INSURANCE, MARITAL_STATUS, b.GENDER, c.FIRST_CAREUNIT, LAST_CAREUNIT
FROM ADMISSIONS a
INNER JOIN PATIENTS b on a.SUBJECT_ID = b.SUBJECT_ID
INNER JOIN ICUSTAYS c on a.HADM_ID = c.HADM_ID
''', conn)
catfeat_df = pd.get_dummies(catfeat_df)
catfeat_df.head()
des_mat = pd.merge(readmit_label_df, catfeat_df, left_on=['HADM_ID'], right_on=['HADM_ID'], how='inner')
des_mat = pd.merge(des_mat, numfeat_df, left_on=['HADM_ID'], right_on=['HADM_ID'], how='inner')
des_mat.head()
des_mat.columns
des_mat.shape
y = des_mat['30_READMIT'].values
len(y)
len(y[y==1])
X = des_mat.drop(['HADM_ID', '30_READMIT'], axis=1).values
X.shape
X_catfeat = X[:, :26]
X_catfeat
X_numfeat = X[:, 26:]
X_numfeat
We visualized the readmission proportion of the two datasets. Apparently, the datasets have highly imbalanced classes. We also visualized the scatter plot of the numerical data. From the scatter plot, the data is hardly separable, which increases the difficulty for model design.
H1_num = len(y[y==1])
H0_num = len(y[y==0])
total_num = len(y)
H1_num
labels = '30Days Readmission-Y', '30Days Readmission-N'
sizes = [H1_num/total_num*100, H0_num/total_num*100]
colors = ['pink', 'skyblue']
explode = (0.05, 0)
fig = plt.figure(figsize=(6,6))
plt.pie(sizes, explode=explode, colors=colors, autopct='%1.1f%%', startangle=90)
plt.legend(labels, fontsize='small', loc='center left', bbox_to_anchor=(1, 0, 0.5, 1))
full_index = np.random.permutation(np.arange(des_mat.shape[0]))
sub_index = full_index[:500]
draw_df = des_mat.iloc[sub_index, :]
draw_df = draw_df.drop(['HADM_ID'], axis=1)
draw_df.loc[draw_df['30_READMIT']==0, '30_READMIT'] = 'N'
draw_df.loc[draw_df['30_READMIT']==1, '30_READMIT'] = 'Y'
draw_num_df = pd.concat((draw_df['30_READMIT'], draw_df.iloc[:, 27:33]), axis=1)
num_feat_plot = sns.pairplot(draw_num_df, hue='30_READMIT', plot_kws=dict(s=30))
We first randomly assigned the data instances into different folds to prepare for cross validation. For preprocessing, we employed down sampling and over sampling methods to relieve the class imbalance problem. For the numerical data in ICU dataset, we used z-scoring to normalize the data. Then we tested logistic regression and decision tree. To increase the performance, we applied the ensemble methods, including bagging logitic regression and random forest. Finally, we evaluated the cross validation performance and tuned parameters to optimize the performance.
from IPython.display import Image
PATH = "./pic/"
Image(filename = PATH + "pipeline.jpg", width=800, height=400)
ros = RandomOverSampler(random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
y_true = []
y_pred = []
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)
Xcat_train, Xnum_train = X_train_ros[:, :26], X_train_ros[:, 26:]
Xcat_test, Xnum_test = X_test[:, :26], X_test[:, 26:]
# standarization
scaler = StandardScaler()
Xnum_train_std = scaler.fit_transform(Xnum_train)
Xnum_test_std = scaler.fit_transform(Xnum_test)
X_train_new = np.concatenate((Xcat_train, Xnum_train_std), axis=1)
X_test_new = np.concatenate((Xcat_test, Xnum_test_std), axis=1)
lr_clf = LogisticRegression(random_state=0, solver='liblinear')
lr_clf.fit(X_train_new, y_train_ros)
y_pred_sub = lr_clf.predict(X_test_new)
y_true = np.append(y_true, y_test)
y_pred = np.append(y_pred, y_pred_sub)
lr_os_conf = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = lr_os_conf.ravel()
lr_os_sen = tp/(tp+fn)
lr_os_spe = tn/(tn+fp)
lr_os_prec = tp/(tp+fp)
lr_os_acc = (tp+tn)/(tp+tn+fp+fn)
print ('Logistic Regression, Over_Sampling')
print()
print('confusion matrix:')
print(lr_os_conf)
print('sensitivity score:{:.4}'.format(lr_os_sen))
print('specificity score:{:.4}'.format(lr_os_spe))
print('precision score:{:.4}'.format(lr_os_prec))
print('accuracy score:{:.4}'.format(lr_os_acc))
rus = RandomUnderSampler(random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
y_true = []
y_pred = []
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)
Xcat_train, Xnum_train = X_train_rus[:, :26], X_train_rus[:, 26:]
Xcat_test, Xnum_test = X_test[:, :26], X_test[:, 26:]
# standarization
scaler = StandardScaler()
Xnum_train_std = scaler.fit_transform(Xnum_train)
Xnum_test_std = scaler.fit_transform(Xnum_test)
X_train_new = np.concatenate((Xcat_train, Xnum_train_std), axis=1)
X_test_new = np.concatenate((Xcat_test, Xnum_test_std), axis=1)
lr_clf = LogisticRegression(random_state=0, solver='liblinear')
lr_clf.fit(X_train_new, y_train_rus)
y_pred_sub = lr_clf.predict(X_test_new)
y_true = np.append(y_true, y_test)
y_pred = np.append(y_pred, y_pred_sub)
lr_us_conf = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = lr_us_conf.ravel()
lr_us_sen = tp/(tp+fn)
lr_us_spe = tn/(tn+fp)
lr_us_prec = tp/(tp+fp)
lr_us_acc = (tp+tn)/(tp+tn+fp+fn)
print ('Logistic Regression, Under_Sampling')
print()
print('confusion matrix:')
print(lr_us_conf)
print('sensitivity score:{:.4}'.format(lr_us_sen))
print('specificity score:{:.4}'.format(lr_us_spe))
print('precision score:{:.4}'.format(lr_us_prec))
print('accuracy score:{:.4}'.format(lr_us_acc))
ros = RandomOverSampler(random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
y_true = []
y_pred = []
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)
Xcat_train, Xnum_train = X_train_ros[:, :26], X_train_ros[:, 26:]
Xcat_test, Xnum_test = X_test[:, :26], X_test[:, 26:]
# standarization
scaler = StandardScaler()
Xnum_train_std = scaler.fit_transform(Xnum_train)
Xnum_test_std = scaler.fit_transform(Xnum_test)
X_train_new = np.concatenate((Xcat_train, Xnum_train_std), axis=1)
X_test_new = np.concatenate((Xcat_test, Xnum_test_std), axis=1)
base_estimator = LogisticRegression(random_state=0, solver='liblinear')
max_features = int(math.sqrt(X_train.shape[1]))
blr_clf = BaggingClassifier(base_estimator=base_estimator, n_estimators=100, max_features=max_features, random_state=0)
blr_clf.fit(X_train_new, y_train_ros)
y_pred_sub = blr_clf.predict(X_test_new)
y_true = np.append(y_true, y_test)
y_pred = np.append(y_pred, y_pred_sub)
blr_os_conf = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = blr_os_conf.ravel()
blr_os_sen = tp/(tp+fn)
blr_os_spe = tn/(tn+fp)
blr_os_prec = tp/(tp+fp)
blr_os_acc = (tp+tn)/(tp+tn+fp+fn)
print ('Bagging Logistic Regression, Over_Sampling')
print()
print('confusion matrix:')
print(blr_os_conf)
print('sensitivity score:{:.4}'.format(blr_os_sen))
print('specificity score:{:.4}'.format(blr_os_spe))
print('precision score:{:.4}'.format(blr_os_prec))
print('accuracy score:{:.4}'.format(blr_os_acc))
rus = RandomUnderSampler(random_state=42)
skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
y_true = []
y_pred = []
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
X_train_rus, y_train_rus = rus.fit_resample(X_train, y_train)
Xcat_train, Xnum_train = X_train_rus[:, :26], X_train_rus[:, 26:]
Xcat_test, Xnum_test = X_test[:, :26], X_test[:, 26:]
# standarization
scaler = StandardScaler()
Xnum_train_std = scaler.fit_transform(Xnum_train)
Xnum_test_std = scaler.fit_transform(Xnum_test)
X_train_new = np.concatenate((Xcat_train, Xnum_train_std), axis=1)
X_test_new = np.concatenate((Xcat_test, Xnum_test_std), axis=1)
base_estimator = LogisticRegression(penalty='l1', random_state=0, solver='liblinear')
max_features = int(math.sqrt(X_train.shape[1]))
blr_clf = BaggingClassifier(base_estimator=base_estimator, n_estimators=100, max_features=max_features, random_state=0)
blr_clf.fit(X_train_new, y_train_rus)
y_pred_sub = blr_clf.predict(X_test_new)
y_true = np.append(y_true, y_test)
y_pred = np.append(y_pred, y_pred_sub)
blr_us_conf = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = blr_us_conf.ravel()
blr_us_sen = tp/(tp+fn)
blr_us_spe = tn/(tn+fp)
blr_us_prec = tp/(tp+fp)
blr_us_acc = (tp+tn)/(tp+tn+fp+fn)
print ('Bagging Logistic Regression, Under_Sampling')
print()
print('confusion matrix:')
print(blr_us_conf)
print('sensitivity score:{:.4}'.format(blr_us_sen))
print('specificity score:{:.4}'.format(blr_us_spe))
print('precision score:{:.4}'.format(blr_us_prec))
print('accuracy score:{:.4}'.format(blr_us_acc))
skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
y_true = []
y_pred = []
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Xcat_train, Xnum_train = X_train[:, :26], X_train[:, 26:]
Xcat_test, Xnum_test = X_test[:, :26], X_test[:, 26:]
# standarization
scaler = StandardScaler()
Xnum_train_std = scaler.fit_transform(Xnum_train)
Xnum_test_std = scaler.fit_transform(Xnum_test)
X_train_new = np.concatenate((Xcat_train, Xnum_train_std), axis=1)
X_test_new = np.concatenate((Xcat_test, Xnum_test_std), axis=1)
dt_clf = DecisionTreeClassifier(max_depth=7, random_state=0, class_weight='balanced')
dt_clf.fit(X_train_new, y_train)
y_pred_sub = dt_clf.predict(X_test_new)
y_true = np.append(y_true, y_test)
y_pred = np.append(y_pred, y_pred_sub)
dt_conf = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = dt_conf.ravel()
dt_sen = tp/(tp+fn)
dt_spe = tn/(tn+fp)
dt_prec = tp/(tp+fp)
dt_acc = (tp+tn)/(tp+tn+fp+fn)
print ('Decision Tree')
print()
print('confusion matrix:')
print(dt_conf)
print('sensitivity score:{:.4}'.format(dt_sen))
print('specificity score:{:.4}'.format(dt_spe))
print('precision score:{:.4}'.format(dt_prec))
print('accuracy score:{:.4}'.format(dt_acc))
skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=0)
y_true = []
y_pred = []
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Xcat_train, Xnum_train = X_train[:, :26], X_train[:, 26:]
Xcat_test, Xnum_test = X_test[:, :26], X_test[:, 26:]
# standarization
scaler = StandardScaler()
Xnum_train_std = scaler.fit_transform(Xnum_train)
Xnum_test_std = scaler.fit_transform(Xnum_test)
X_train_new = np.concatenate((Xcat_train, Xnum_train_std), axis=1)
X_test_new = np.concatenate((Xcat_test, Xnum_test_std), axis=1)
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, class_weight='balanced', n_jobs=-1)
rf_clf.fit(X_train_new, y_train)
y_pred_sub = rf_clf.predict(X_test_new)
y_true = np.append(y_true, y_test)
y_pred = np.append(y_pred, y_pred_sub)
rf_conf = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = rf_conf.ravel()
rf_sen = tp/(tp+fn)
rf_spe = tn/(tn+fp)
rf_prec = tp/(tp+fp)
rf_acc = (tp+tn)/(tp+tn+fp+fn)
print ('Random Forest')
print()
print('confusion matrix:')
print(rf_conf)
print('sensitivity score:{:.4}'.format(rf_sen))
print('specificity score:{:.4}'.format(rf_spe))
print('precision score:{:.4}'.format(rf_prec))
print('accuracy score:{:.4}'.format(rf_acc))
The classes are highly imbalanced, so we cannot simply count on accuracy to evaluate model performance. We computed four performance scores, including sensitivity, specificity, precision and accuracy. We faced the sensitivity and specificity trade-off when tuning parameters. The goal of the project is to predict patients at high risk of readmission to provide them with special care, so we emphasized the sensitivity score when evaluating model performance.
def draw_conf_mat(conf_mat, title, ax):
conf_plot = ax.matshow(conf_mat, cmap=plt.cm.Blues)
class_names = ['Not', 'Readmit']
ax.set_xticklabels([''] + class_names,fontdict={'fontsize':'20'})
ax.set_yticklabels([''] + class_names,fontdict={'fontsize':'20'})
plt.title(title, fontsize='28')
plt.xlabel('Prediction', fontsize='20')
plt.ylabel('Truth', fontsize='20')
num = conf_mat.ravel()
plt.text(-.15, 0, str(num[0]), fontdict={'color': 'white', 'size':28})
plt.text(.85, 0, str(num[1]), fontdict={'color': 'white', 'size':28})
plt.text(-.15, 1, str(num[2]), fontdict={'size':28})
plt.text(.85, 1, str(num[3]), fontdict={'size':28})
conf_mat_list = [lr_os_conf, lr_us_conf, blr_os_conf,blr_us_conf,dt_conf, rf_conf]
clf = ['Logistic Regression_OverSampling', 'Logistic Regression_UnderSampling',
'Bagging Logistic Regression_OverSampling', 'Bagging Logistic Regression_UnderSampling',
'Decision Tree', 'Random Forest']
fig = plt.figure(figsize=(30,20))
for i in range(1, 7):
ax = fig.add_subplot(int('23{}'.format(i)))
draw_conf_mat(conf_mat_list[i-1], clf[i-1], ax)
if i!=1 and i!=4:
ax.set_ylabel('')
clf = ['LR_OS', 'LR_US', 'BagLR_OS', 'BagLR_US', 'DT', 'RF']
sen = [lr_os_sen, lr_us_sen, blr_os_sen, blr_us_sen, dt_sen, rf_sen]
spe = [lr_os_spe, lr_us_spe, blr_os_spe, blr_us_spe, dt_spe, rf_spe]
prec = [lr_os_prec, lr_us_prec, blr_os_prec, blr_us_prec, dt_prec, rf_prec]
acc = [lr_os_acc, lr_us_acc, blr_os_acc, blr_us_acc, dt_acc, rf_acc]
perf_dict = {'classifier': clf, 'sensitivity': sen, 'specificity':spe, 'precision': prec, 'accuracy':acc}
draw_df = pd.DataFrame(perf_dict)
draw_melt_df = pd.melt(draw_df, id_vars=['classifier'],)
fig = plt.figure(figsize=(30,30))
perf_plt = sns.catplot(x='classifier', y='value',col='variable', col_wrap=2,
kind='bar', data=draw_melt_df, palette=sns.color_palette('Paired'))
perf_plt.axes[0].set_title('Sensitivity', fontsize='20')
perf_plt.axes[1].set_title('Specificity', fontsize='20')
perf_plt.axes[2].set_title('Precision',fontsize='20')
perf_plt.axes[3].set_title('Accuracy',fontsize='20')
perf_plt.axes[2].set_xticklabels(labels= clf, rotation='vertical',fontdict={'size':20})
perf_plt.axes[3].set_xticklabels(labels= clf, rotation='vertical',fontdict={'size':20})
perf_plt.axes[2].set_xlabel('')
perf_plt.axes[3].set_xlabel('')
perf_plt.axes[0].set_ylabel('Score', fontsize='20')
perf_plt.axes[2].set_ylabel('Score', fontsize='20')
perf_plt.axes[0].set_yticklabels(labels=['0','0.1','0.2','0.3','0.4','0.5','0.6','0.7'], fontdict={'size':20})
perf_plt.axes[2].set_yticklabels(labels=['0','0.1','0.2','0.3','0.4','0.5','0.6','0.7'], fontdict={'size':20})
for index, row in draw_df.iterrows():
perf_plt.axes[0].text(index-.35, row.sensitivity+.015, round(row.sensitivity,2), fontdict={'size':14})
perf_plt.axes[1].text(index-.35, row.specificity+.015, round(row.specificity,2), fontdict={'size':14})
perf_plt.axes[2].text(index-.35, row.precision+.015, round(row.precision,2), fontdict={'size':14})
perf_plt.axes[3].text(index-.35, row.accuracy+.015, round(row.accuracy,2), fontdict={'size':14})
Compared to other readmission prediction work, our classifiers perform fairly. One of the challenges in this project is that readmission data is inherently imbalanced. This gives rise to issues regarding designing the model as well as evaluating the model. Class weights can be modified in certain classifiers like random forest and neural network. When it comes to logistic regression, we employed statistic techniques like down sampling and over sampling. Accuracy is obviously inappropriate here. Therefore we also digged into model's sensitivity, specificity, precision and confusion matrix. Ensemble methods such as random forest and bagging logistic regression turn out to provide better performance than a single classifier.